2015-04-30
----------

Quick thought. If we have -d, then parse.c includes parse.h, which
includes the prologue. But the prologue is also repeated in parse.c.
This tends not to work.

2013-11-10
----------

Things to do for release:

* update the README file (and NEWS)
* work out which documentation output files to include (.pdf, .info, etc)
* consider doing previous step on F19 (vps?) so unicode works - it doesn't
* fiddle with MANIFEST till the release has all the right stuff in it
* make sure that "make clean" works
* ensure that pacc0.c is up to date
* tag the release
* remake the docs

2012-05-14
----------

Mainly this is now going on the website. However, I did want to record
here that the fonts in the logo are Overlock Bold and URW Palladio L
Italic.

2012-05-09
----------

Couple of things we can do here. We could have a node type ``coord``
that holds int[2]. We could have node types of ``line`` and ``column``
that both hold an int. We could have a single node type of type ``num``
that holds an int, and distinguish them by their position. On the whole,
the first seems easiest. Especially since I've decided that it's OK to
use an anonymous union (supported in C11 apparently as well as ganuck
and others).

Almost there, but still some minor hitch with positioning, and one of
those darned pacc2 vs pacc3 errors... which was easily fixed.

2012-05-06
----------

Guards! Guards! They don't have #line numbers yet.

Well that's something I wouldn't have anticipated. First #line
implementation produces code that looks like this:

          cur->value.u106= (
#line 7 "bad/line.pacc"
           ++ 
#line 332 "parse.c"
          );

But in this case, gcc doesn't spot the syntax error till it sees the
closing brace, so we need to put that on the same line as the
expression. Hmm... and now it says

bad/line.pacc:7:15: error: expected expression before ‘)’ token

which is a touch confusing, but I don't think there's much I can do
about it. The column numbers are wrong, too, which will be fiddly to
fix, sigh. Oh, here's a thought. Instead of tediously using snprintf
(twice) to turn a line number into a n->text node, if the AST actually
supported numeric nodes, we could carry the line and column numbers
through as numbers. Then the line numbers trivial to turn into a string
(because we're writing to a stream, rather than memory), and the column
number can be used as a number to indent appropriately.

2012-05-05
----------

When we report an error, we calculate the line number (and column
number) in the input by counting '\n's. Which is all right as far as it
goes, but we don't want to keep doing that every time we encounter an
expression.

OK, I don't know why I've been holding off on this (except perhaps that
there's this vast body of code that I barely understand). We need
_pacc_line_number(_pacc_parser *, off_t). Initially use the code we've
got, counting '\n's. Later we can memoize it.

(I'm worry about off_t again; wondering if we wouldn't be better off
with intmax_t or intptr_t or int_fast64_t. The problem is, I need to
write some code that will convert a line number into a string. Oh, but
hang, they're longs anyway. OK, no problem.)

Right, well, that all seems to work *except* there's discrepancies
between pacc2.c and pacc3.c on node numbers. Right, because pacc1
doesn't build line numbers into the tree. So let's fix that. Done.

2012-05-02
----------

So, onto how to emit sensible #lines for the user-supplied text. At the
moment, we have nodes in the parse tree like "expr 102: 5". Either we
change those to include the line number where that expr came from, so we
might have "expr 102: 7 5", or we add another node type.

2012-04-30
----------

So this is a major rewrite of emit.c, and I've got to error(), which is
the major cause of bloat in the emitted code. For example, pacc2.c is
currently 11937 lines. I suppose now would be as good a time as any to
fix that up. Well, it drops to 9331 lines, which is a > 20% reduction,
so that's actually not bad.

Reworking everything to go through c_str() and friends is complete. I
could add the #line directives that go back to where we were, which was
the whole point of that rewrite.

Argh, so that's full of hideousnesses. First, #line is supposed to be
left-justified, but that doesn't match the way indents work at the
moment. Second, we need to know how long the template is. Thirdly, it
means that the "diff" to check pacc2.c versus pacc3.c fails miserably.
All solved.

2012-04-28
----------

So what about implementing #line directives? First thing is that we need
to get line numbers into the AST. And I'm afraid to say that I really
suspect the easiest (and possibly best) way to do this is to make an
`expr` node hold a value like

    6:"yes"

which will eventually turn into

   something_or_other = (
   #line 6 "parse.pacc"
   "yes"
   #line 943 "parse.c"
   );

These are both fiddly - it's irritating that the C pre-processor does't
have a "#unline" directive. The second "#line" will need to be on line
942 of file parse.c to tell it that the next line is 943 in parse.c!

To implement this, I'm going to expunge all the random printf()s in
emit.c. This will make removing redundant types from the union much
simpler, too.

2012-04-19
----------

I had a brilliant idea today. Why not write an RFC-822 (or whatever it
is these days) parser? Lots of sample input texts, many slightly
invalid. And I have lots of experience in the area. (Can I still find a
copy of TIM?)

Quick note on #line directives: they look like this:

    #line 23 "foobar.pacc"

2012-04-18
----------

10 months and 24 days. This is barely happening, is it? I spent a while
last night reviewing this Journal, and the ToDo file. I'm quite
staggered by how much work I've done on this (and how opaque most of
these notes are!). Some thoughts:

The Java parser seems like a bit of a red herring. Let's focus on making
a release.

In another project, I divided the todo list into "blockers" and
"laters". Let's try that here.

Blockers:

- add #line directives
- implement --feed / partialization (this is my unique selling point)
- complete command line interface
- detect left recursion
- one instance of each type per union
- dynamically allocate everything
- UTF-8 support
- basic documentation

Laters:

- void types
- autoconfiscation, because I really want to make at least one public
  release with my beautiful pristine makefile
- parser optimization
- value inheritance
- XXX cleanups
- review
- more research
- permissions (since I won't actually distribute the Java parser yet)

Now done:

- Scoping

    Names in pacc grammars are supposed to scope statically over
    sequences, but that clearly isn't the case at the moment, otherwise
    test/pacc/scope1.pacc would compile!

First up would seem to be #line directives. As far as I can see, we
include user-supplied in emit.c::emit_expr(), ::guard_post(), and that's
all. However, where do we get the information from to go in the #line
directive? As far as I can see, we're not storing that anywhere at
present, so we'll need to fix up the Result rule to store that somehow
(a new node type?)

2011-05-24
----------

Plan: keeping a copy of java.pacc as it now is, remove all semantic
actions from the grammar. This should give us a recogniser for valid
utterances in the Java language, although it won't have a clue what they
mean.

Oh, here's something I came across whilst doing this:

 QualifiedName :: {[Identifier]} =
	  i:Identifier is:(".":Sym i:Identifier -> i)*	-> {i : is}

I didn't realise you could write that in pappy. Currently the issue of
whether, and how, the * and + operators construct a list of values is
moot.

Some progress, but I'm tired. Really must get #line going, to make this
this sane to debug.

2011-05-22
----------

OK, so I have a supposed implementation of stack frames for scoping...
but now pacc1 cannot build pacc2 with a curious error. All the emit
tests work, so I think I have to devise one that doesn't, but that's
tricky since I don't know what's going wrong.

Managed to spot a problem and add an assert that flags it up. Many of
the emit test now *do* fail, where they had insufficient doses of "seq".
We're more sensible now that bindings are scoped only over a sequence,
so for example each leg of an alt must be a seq. The code emitted for
"alt" no longer produces anything that can mask this error.

Ugh. I'd forgotten that I'd implemented dequery. It's great, but I'm now
saying that seq is always the (only) child of a rep. This breaks
dequery. Sure, I can fix it, but is all this insistence on imagined seqs
everywhere the right way to go? After all, if there is only a single
child of a rep, it can't possibly be binding anything.

Oh wow. After some considerable time failing to get anywhere, I replaced
the nasty <frame> markers in emit.c with a frame stack. As well as being
loads cleaner, it worked first time. And the number of failures has
dropped to 2, for the first time in ever-so-long. One of those is a
fantasy concerning error messages, and the other is Java. So that's
looking pretty good.

The Java parser is now parseable pacc, but the semantic actions are not
C, so we don't build a working parser. That means that generating #line
directives is the next thing to add.

2011-05-21
----------

Far too long. I have hundreds of threads to pursue, and not enough time.
I suspect I should publish, and see what occurs. For now, I'd like to
take a quick look around, see if there's any failing tests, just have a
hack.

OK, first thing I think "make" should include the third-stage compiler
check automatically, and "make check" should run the test suite. Done.

First test suite failure is a core dump in emit/mk-class0.c. What's that
about? Might just be an error in the test. Seems that it builds a bogus
parse tree, since nothing else says we have to handle a cclass without a
text. There's nothing that seems to be a corresponding failure using
pacc itself to parse the grammar. Yup, sorted.

Down to 5 fails. First is pacc/class2.pacc. This is a case that fails to
parse, and we disagree with the error message. I think pacc is right,
fixed by fixing the description of the test case. 4 fails.

There's pacc/err0.pacc. That's a fantasy test, to do with error
reporting. Let's handle that case, then. OK, nothing fancy, just echo
something after the failure, so less time is wasted.

Next is pacc/scope1.pacc. Aha! I was hoping to get to this. OK, so
there's cleanup of name handling to do in emit.c. For a start, s_ptr has
now gone. So we can definitely go down to a single push. Hang on. Done.
Oh, let's make the stack autogrow too. Done that, and I think what comes
next is obvious. Instead of resetting the stack, we just want to unpop
as many names as we've pushed for the current sequence. And where do we
save that? Hmmm...

2011-01-19
----------

On another project, I'm playing with various bits of GNOME, and I've
just noticed that the gobject-introspection package uses flex and bison.
I've no idea what it's parsing or why, but its's obviously a potential
candidate for a pacc parser instead.

2010-11-30
----------

What next? Fix character classes to support C escapes, and also the
special escape \]. (Note that we don't need to escape "-", since it
should work at the beginning, or end of a class, and also after any
range, for example "[A-Za-z-_]". Need tests for all these cases, of
course.) Rewrite pacc.pacc to use character classes. Copy new rules into
pacc0.c.

Hmm... making "-" work as expected was more effort than I'd expected. I
was bitten by the usual problem that PEGs never backtrack, even when you
expect that they might, and also that some of the low-level rules that
match escaped characters were of type void -- they were normally glued
together by a higher-level rule that used refs to get at the actual
string value. But for character classes we need to know the characters.
(Along the way, I think I spotted some problems with "#" comments
occurring inside rules that I should pursue.)

2010-11-24
----------

So basic character classes work, all the way from pacc.pacc to the
compiler! Still missing are inverted character classes, and proper
handling of character escapes.

(Inverted character classes are technically unnecessary, since "[^x]" is
equivalent to "![x].". Hmm... we could actually implement them that way
in pacc.pacc. But on further reflection I think it's easier -- actually
less code -- to carry the inversion down to emit, as I'd originally
envisaged.)

But first, some tidy up. I made a bit of a hash of syntax.c in the last
throes of making character classes work, and thinking about it I suspect
snoc is actually append (essentially, we don't distinguish atoms from
lists). Let's see... that seems fine. And negated character classes
work. Cool.

2010-11-22
----------

At the moment I'm thinking about character classes. It seems to me that
we must add at least two node types to the grammar: cclass introduces a
class; its children of type crange are a list that give the various
ranges within the class. The value of the cclass node itself is null for
a normal character class; anything ("^", say) for an inverted character
class.

The crange node's value consists of a type byte, which is ">", "=", or
"<", followed by the character in question. So for example the character
class [A-Z_] would appear in the tree as:

    cclass: (null)
      crange: ">A"
      crange: "<Z"
      crange: "=_"

Code emission is fairly obvious I think.

Also, I'm now more or less determined that we will support different
character encodings. The important ones I know of are byte (no character
encoding, each byte in the input represents itself), UTF-8, and UTF-16.
Mainly I just need 2 functions. One function returns the length in bytes of
the next character on the input (obviously this is always 1 for the byte
noncoding; more complex for UTF-8 / UTF-16), which is used by the any
matcher. The other function returns the next character on the input
decoded into an int (or UTF-32), for use in the character class matcher.

Supporting various combinations like "grammar is written in UTF-8 but
input is in UTF-16" will be a little more complicated, but I think I can
do it using function pointers in the struct _pacc_parser object. But
more of that later.

2010-11-13
----------

Thinking about better error reporting... we need a way to let rules that
parse multiline bracketed things (like C comments, or blocks) to save
where they saw the start of the comment. For example, I just left a
comment unclosed, and was told

    parse.pacc:16:2: expected C_Comment:*:1, or "*/"

where 16:2 is end of file. This is pretty nifty, but needs to also say
where the comment began.

Finally, redone pacc0.c to match the improved grammar, and "make check"
works again. Good night.

2010-11-13
----------

Hmm... I'm a little confused. Currently pacc.pacc says this:

    Rule1 
	← Epsilon r:Result → { s_kid(seq, r) }
	/ Epsilon → { s_kid(seq, 0) }

Obviously we want to change it to say this, instead:

    Rule1 
	← Epsilon r:Result? → { s_kid(seq, r) }

So here's a tiny test case:

    S ← x:Five? → x
    Five ← "5" { 5 }

But that can't work, because the query is implemented via an iRule
(remember those?) and iRules are always :: ref_t. Darn.

It's actually almost there; the dump for this code (x is bound to S:x:0)
looks like this:

      rule 118: S:x:0 
	type 117: ref_t 
	seq 116: 
	  call 102: S:x:0:*:0 
	  expr 115: ref() 
      rule 127: S:x:0:*:0 
	type 126: int 
	alt 125: 
	  seq 124: 
	    call 123: S:x:0:1:0 
	  seq 122: 
      rule 121: S:x:0:1:0 
	type 120: int 
	call 101: Five 

So all we'd need to do is propagate the return in seq 124, add an "expr
0" to seq 122, and somehow make the binding to S:x:0:*:0 instead. (I
suspect that the int types of the last two iRules are "virtual voids",
but that could be fixed, too.) Hmm...

There's also the issue that although this might work reasonably well for
the "?" operator, it's not going to work for "*". If the rule was "S ←
x:Five* → x", we can return 0 for no matches, but what do we return if
it matches twice? In Haskell, "[5,5]". C has vectors, of course, and we
could make x :: int[], but we don't know the length of the vector.
Hmm... we could say that x :: (int *)[], with a null pointer to mark
the end of the list, but that throws up a lot of memory management
issues. We can't foist all this on the poor user.

Another idea, that I think I've had before, is to give rules a way to
say how they should be combined. For example, if "Five*" is supposed to
return the sum of all the 5s it matches, we might say

    Five ← "5" { 5 } { 0 } { a + d }

and the whole thing is evaluated like foldr. The "a" and "d" are bound
by the compiler. Or what about saying that the compiler will call the
functions

    int Five_zero(void)
    int Five_fold(int, int)

to evaluate the list (perhaps only if it's seen those names in the
preamble).

Obviously this will all have to be thought about much more. I do think
that the "match or 0" effect for "?" should be done. It's interesting
that query is in the end quite different from "*". OK, so this is done,
although sugar.c is looking very ugly again.

2010-11-13
----------

So I really thought that I could give ε a precedence lower than
sequencing, so that

    Foo ε Bar

is a syntax error. The only snag is that currently results are members
of a sequence, so putting  ε below sequencing would forbid rules like
this.

    Foos ← f:Foo fs:Foos { cons(f, fs) } / ε { 0 }

I was going to bodge round this by saying that actually ε is shorthand
for "{ 0 }" (glossing over what happens with void rules), but imagine my
surprise to find this in pacc.pacc:

    TypeOptional ← ... / ε { s_stashed_type() }

So really, no that won't fly at all. (Good, 'cos the "{ 0 }" bodge was
nasty anyway.)

Now, baf doesn't have problems like this, as his results come after a
sequence, and are not part of it. I changed this to implement lmatch()
and rmatch(), but they went away back in February. So let's go back to
that; test/bad/result0.pacc currently succeeds. And now it doesn't. This
has introduced a "make check" failure, though. It just seems be
different node ids, but that's not supposed to happen. Hmm...

OK, here's another small dose of clarity. "make check" compares the
output of pacc1 with the output of pacc2, and thus identifies functional
differences between the hand-crafted and generated grammars. In this
case, pacc1 parses "ε { 5 }" to "seq / seq / expr 5", where pacc2
produces "seq / expr 5". It is this difference that "make check" is
picking up here. I need to fix this, but I want to get pacc.pacc better
before doing so.

By the way, I just implemented the "-d" switch to optionally dump the
parse tree before desugaring (use "--dump=post" to get a dump both
before and after desugaring). Shoulda done this ages ago; it's *so* much
nicer than commenting in the s_dump() calls in main.c, then having to
recompile, then getting an ultra verbose pacc that fails most of its
test suite.

Current tasks:

1. Make "?" and "*" return the value "0" if they matched nothing.
Modify pacc.pacc to exploit this.

2. Implement character classes.

2010-11-11
----------

A new type of test case: inputs that are in fact not valid pacc
grammars, for example

    S <- .**

The proposal is to put these in test/bad, and then rework run-all.sh,
*not* to know how to run these tests itself, but rather to call some
bad-specific run script. (Obviously, we will then use this new
abstraction to rework emit and java, too.)

OK, done. The test harness just became achingly beautiful, with
sprawling shell functions slashed to a fraction of their size, yet with
all their functionality maintained. Bourne shell can be good, when it's
done right. So that's 6 done.

What about 5? I can show you a grammar that incorrectly sequences
epsilon. And one that combines bound literals with repetitions in an odd
way. And of course pacc/scope1.pacc fails, although that may or may not
be related. 

In fact, the current grammar works surprisingly well. Still, that's no
reason not to improve it. :-)

2010-11-10
----------

Spent the day at the freezing market with my notebook to hand. I now seem 
to have several strands ongoing. Here is a summary; more details are in
this file or my notebook.

1. Investigating error reporting (see earlier entry).

2. Sorting out the cut'n'paste mess that sugar.c is becoming.

3. Refactoring error reporting to use the struct _pacc_parser "object";
it shouldn't be riveted to the parser engine itself as it is now.

4. Rewriting the grammar to express the relationships between different
bits of syntax in terms of precedence.

5. Recognising that those changes should be prompted by test cases that
demonstrate errors in the present grammar.

6. Realising that this is a new *type* of test case: a grammar which is
meant to fail, and that the test harness can be extended (perhaps in a
reasonably clean way) to handle this new type.

7. And implementing character classes down to the compiler. (Nearly
forgot that one.)

OK. So that's a 1 2 3 6 5 4 7. Let's see how far we get...

Right, well the problem with number 1 is that I can't actually work out
what I want error reporting to do at the moment. I'll come back to that.

2a. Replace the three tree walks with a single one (but note that some
desugarings interact).

2b. Revoke the idea of creating unique rule names fit for human
presentation. Create unique rule names, and use the imaginary error
syntax to create good errors.

2c. Implement the imaginary error syntax.

OK, so after a long struggle (I think there are still problems with bits
of test suite not being rebuilt when they should be), I have tidied up
sugar.c as much as I can for now: 2a and then some. We may reconsider 2b
and 2c later on.

3 is done, as well as some additional tidying up around the mess left by
the stack reworkings. The rest must wait now.

2010-11-10
----------

Darn. So I worked out why the comment test fails: the new system for
generating names for desugared reps would produce a name like
C_Comment:* for the first rep in the rule, and the same name again for
the second rep! Hilarity ensues...

This is a bit of an ugly bugger to fix. Obviously we can readily make
the names unique again, for example by reintroducing the id as part of
the name for invented rules. But this means we get rules with names like
Word:129:*, which is not exactly pretty again. (Although still a slight
improvement on x.129!) Could we make up names like C_Comment:*:0 for the
first desugared rep?

Well, yes, of course we can, although the code gets less and less
pretty. Still, it's all in the name of good public relations, or
usability as it's sometime known. We probably need to make similar
changes to the deblit code.

In fact, here we go: "parse.pacc:269:7 expected b.529". OK, so that
would like much nicer as "... expected NameStart:c", which is what we
now have. However, sugar.c is getting increasingly cut'n'pasted.

Also, this error is being produced because we don't support 'x'
character constants (and I don't intend to, either). But at this point
there are several other options. Why doesn't pacc report any other
possibilities? I feel another test case coming on...

2010-11-09
----------

So viewing the java work as an opportunity to smarten up usability, I
find I must change all the emit scripts, as the interface to parse has
acquired a new argument (darn).

OK, done. We're still a little way off handling file inputs in the test
suite, mind you. And there's one genuine new failure: my changes to
comment handling seem to have introduced a problem which is picked up by
pacc/comment0.pacc. Good! This is the point of the test suite, after
all.

Also, it occurs to me that error output should be handled by a separate
function that takes a struct _pacc_parser argument; parse() will be a
simple to use wrapper, that handles common cases, but if you want finer
control you can create a parser object directly. More thought needed
here.

2010-11-07
----------

Work on the stack is nearly complete. I thought I'd hit a real snag:
the savecol() / acceptcol() stuff saves and restores the "column
registers", which included the pointer into the column stack. Obviously
this seemed like a good idea at the time, but in fact simply removing
the pushes of col_ptr has no observable effect. Phew! I thought I was
going to have to work out what the hell that was all about, but since it
was about nothing, it won't use up too much brain power. Clean up time
now.

Back to looking at java.pacc. First thing it reveals is that we don't
handle "\r", or indeed most other C escapes.  The complete definition of
an escaped character is quite hairy, since it includes a 1-3 octal digit
sequence, as well as various hex sequences and unicode values. Currently
we don't support them all; this impacts pacc.pacc and more unpleasantly
emit.c, where we need to calculate the length of the named string (for
example, "\n\13x\177" names a string of 4 characters).

So what I'm currently doing is throwing baf's Java parser at pacc, and
fixing each failure as it occurs. It strikes me that this is a very good
way to get a user handle on pacc, and I could use this as an opportunity
to polish up error handling.

The two obvious problems at the moment are: invented rule names
("expected x.159") and the lack of line number resolution ("at column
6080"). So let's fix them, eh?

2010-10-23
----------

The better stack is coming slowly. It now works for pushing and popping
"cur", but there's a crying need for a macro to tidy up the nastiness.

2010-08-26
----------

So I was just about to deal with removing the m stack bodge, when it
struck me that we really could quite easily have a single stack. My
first effort involved a union type, but the snag with doing it that way
is that each stack slot is as large as the largest possible. No problem
at all on a typical 32 bit architecture, but I don't want to assume
that.

The right way to do it, of course, is to maintain the stack pointer as a
char *, and simply use sizeof to advance it the right amount. So that
now works for pushcont, but there's a problem with pushm; we use the m
stack pointer to know when we're done. Hmm... I guess we could just grab
the current stack pointer at that point; we're done when we get back to
there.

2010-07-07
----------

Reworked bound literals to fix a bug in the strncmp implementation.

Now, we really really need a serious test case parser. Porting baf's
Java parser still has to be the best option. Here goes...

We will have a test/java directory, containing java.pacc and any Java
source file we want to check. The test harness will be extended to
include "parse_file" and "noparse_file" commands; these correspond to
"parse" and "noparse", except the argument is the name of a file
containing the test input, not the entire test input itself!

First snag... feeding a very broken java.pacc to pacc produces "st_stack
overflow". So let's fix st_stack bodge. OK, done, and the error is now
the only slightly more useful "expected Char at column 3600", but there
you go.

Blimey! Java.pappy uses "=" to introduce a rule's definition. So let's
add that to pacc's idea of what an "lArrow" is. By the way, if pacc
says "expected Char at column 4212", then you can see where that is:

    dd 'bs=1' 'count=4212' <java/java.pacc

OK, now, with a bit of hackery, we're up to "out of m stack space",
and I'm out of time. So removing the m_stack_bodge is where to start
next.

2010-05-31
----------

Next steps:

1. implement bound literals;

2. use them to port baf's Java parser;

3. with a decent-sized parser, and an enormous range of possible inputs,
go back to optimizing...

First part of step 1, write a test case. Current stats are 315 passes, 4
fails. Added test/pacc/blit0.pacc... and made it work!

Hmm... I think that pacc.pacc is not as clean as it could be. What,
really, is the difference between a Matcher and a PrimRule?

2010-05-15
----------

Thinking about "bound literals", which are heavily used in baf's Java
grammar to model reserved words:

PrimExpr :: Expression =
    ...
    / "this":Word		-> {ExpThis}
    / "super":Word		-> {ExpSuper}
    / ...

Incidentally, I claimed on 2010-01-15 that Java.pappy includes no
semantic expressions. I'm not sure what I was smoking, but this is
completely wrong: in fact baf builds a complete AST. My apologies.

In principle, and for orthogonality, we allow anything to the right of
the colon, although at present I can't think of a non-contrived use for
this.

    Trip <- Letter Letter Letter
    Foobar <- "foobar":(Trip Trip)

Anycase, the existing desugaring should handle this just fine, so we can
just consider "lit":Rule without loss of generality. And this is just
sugar for

    n:Rule &{ ref_strcmp(n, "lit") == 0 }

except that pacc will create a new name in its own _pacc namespace.

In the (unde)sugared tree, we need to add an optional child to lit,
which is the binding.

    lit: "if"
      call: "Word"

which becomes

    seq:
      bind: "_pacc_bl273"
	call: "Word"
      guard: "ref_strcmp(_pacc_bl273, \"if\") == 0"
	ident: "_pacc_bl273"

Hmm... we must insist that Word :: ref_t (we might be prepared also to
allow Word :: char * and select strcmp() when this is so).

What about this idea of analysing the tree to find out the maximum
possible evlis for each rule? I had imagined walking the tree at the
optimization stage, and couldn't see how to communicate the information
to emit(). But actually we could do it in rule_pre() as we are about to
emit each rule.

Insight! If we make this a little more abstract, so it can count the
maximum of any type of node, then we can raise an error if pathmax(expr)
is > 1 (or rule type is void and pathmax(expr) > 0).

int pathmax(struct node *n, enum s_type t) {
    int r = 0;
    struct s_node *p;
    assert(!s_has_children(t));
    if (n->type == alt)
	for (p = n->first; p; p = p->next) {
	    int m = pathmax(p, t);
	    if (m > r) r = m;
	}
    else if (s_has_children(n->type))
	for (p = n->first; p; p = p->next)
	    r += pathmax(p, t);
    else if (n->type == t)
	r = 1;
    return r;
}

There's no *guarantee* that this will save space, since a rule could
have a very large max which is very rarely reached. In this case we will
allocate many evlis entries that are unused. But it seems like it should
be a win almost all the time.

2010-04-30
----------

So here are some ideas for tightening up struct intermed further.

1. Move the status somewhere else. There isn't really room in col or
rem, but maybe expr_id?

2. It is inconceivable that ev_valid and ev_alloc would ever need to be
longer than a short. This saves 4 bytes on a 32 bit architecture. (I
wondered about having an "inconceivable(ev_valid == 0xffff)" macro, on
reflection I think is should just be assert().)

3. Insight! Instead of storing rule/col pairs on evlis, resolve them to
struct intermed * pointers. This saves 4 bytes, or 8 bytes if off_t is
64 bits. (OK, it's rather eager evaluation, but is surely worth it?
Would need to profile.) [ 2010-11-07: no, we *can't* do this, even if we
wanted to, because the struct intermed array can shift under our feet. ]

4. Only rules which are parsed need an evlis. Only rules which are
parsed and not void need an expr_id. Only rules which are parsed, not
void, and evaluated need a value union. Is it possible to exploit this
to save some space, but without having to chase pointers everywhere?

5. Static analysis of the tree would allow us to calculate the longest
evlis that each rule could possibly need, allocate that up front, and do
away with ev_alloc altogether.

2010-04-28
----------

OK, so expr_id is out of evlis, and into struct intermed. This seems to
open up huge possibilities for further cleanup. For example, since we
now always push a call/col pair onto evlis, the union can turn into a
struct, and things like "pos += 2" can become "++pos". Done.

And if I can eliminate the need to distinguish calls and binds, then all
the bit twiddling for enum thr can go too. Let's see: we'd need to
pushn() and pushs() dummy values when we compile a call the value of
which is *not* bound. Yes, that works. There's still the messy "binding"
global variable in the compiler, but the main point is that we can now
totally eliminate enum thr. Yup, done.

So, I think that really is just about everything sorted, with plenty
tidy up still to be done, of course. And what are the vital stats now?

    ; sh run-all.sh 
    315 passes
    4 fails
    ; wc pacc.pacc
     137  488 2852 pacc.pacc
    ; valgrind ./pacc2 pacc.pacc -o pacc3.c
    ==2641==     in use at exit: 989,559 bytes in 14,592 blocks
    ==2641==   total heap usage: 21,404 allocs, 6,812 frees, 1,769,503 bytes allocated

Oh darn, the total memory allocated has gone *up* slightly, to 620 bytes
per byte of input. That's annoying. The changes are still good, as they
are huge simplifications, but still...

2010-04-27
----------

Long gap. Sigh. To recap: next stage is to remove support for col_expr,
and tidy up around that. So far, the step stuff has gone, and
save_core() no longer takes a "cols" argument; it never saves any
columns.

Now I want to move thunks into struct intermed. Hmm... this is a bit
more complicated than I thought. For a start, the current evlis allows
any mingling of thunk calls and rule calls that may be needed. In the
new regime, we need to... call all the subrules first, and then the
thunk. I think.

Also, when we push a thunk onto evlis, we also push the current
rule_col. So if thunks move into struct intermed, it looks like we need
another column stored too. Surely this is wrong? Yes, indeed. The pushed
rule_col is always the same as the current col! So we can eliminate
that; I just did.

2010-03-03
----------

Fixed the emit/mk-nlg[01] tests. If you turn memoization off in emit.c,
these tests now consume enormous amounts of memory before failing. Some
of the versions of the baf calculator also become useless, as does
pacc.pacc itself. (In fact, they're *so* hopeless -- even an input like
"aaaaaac" completely fails -- that I wonder if something else is going
wrong? Better take another look when I get the chance...)

2010-03-02
----------

In haste: I have less than 5 minutes today. There is nothing arbitrary
about "only one expression". We require each possible path through a
rule to yield exactly one expression (or none for a void rule). We
should allow them to appear anywhere, so you can say something like
this.

  Term <- "(" ( n:Number -> n / v:Name { lookup_var(v) } ) ")"

But here's the main curiosity I wanted to get down. We can easily check
"exactly one expression" at run time: set p->expr to 0 when it is
initialized; when we know the id of the expression that we will want,
check that p->expr is 0 before setting it to that id.

We can, and will, also check statically that the parse tree obeys the
"exactly one expression" rule; but that's a somewhat hairier check, so
we will use assert() to do the run-time check. The check itself is
simple:

    assert(p->expr == 0);
    p->expr = eid; /* actually, eid is a compile-time constant */

And when we initialize a p, we say

    assert((p->expr = 0) == 0);

That assertion can never fail, but it is a *correct* use of a
side-effect in an assertion!

That's all folks...

2010-02-25
----------

Next step will be to remove support for col_expr. This will save some,
possibly quite a bit, of space. So here are the interesting statistics
before we do that...

    ; sh run-all.sh 
    315 passes
    4 fails
    ; wc pacc.pacc
     137  488 2852 pacc.pacc
    ; valgrind ./pacc2 pacc.pacc -o pacc3.c
    ==5986== malloc/free: in use at exit: 971,991 bytes in 16,553 blocks.
    ==5986== malloc/free: 38,253 allocs, 21,700 frees, 1,739,467 bytes alloc'ed.
    ; echo 1739467 / 2852 | bc
    609

Outta time. The variable col_expr has gone; need to tidy up the
encoding, and refactor.

2010-02-24
----------

So where are we at? It should now be possible to remove copy() and
match()... No, while it now compiles, it fails 69 tests (we "should"
currently be failing 4, although I think I have to do something about
that quite soon - it's really annoying to have "expected" unexpected
failures). So why's that? Oh. Some of the emit tests still use match().
Darn, I thought I'd fixed them all up. OK, here goes... done.

2010-02-23
----------

What I'm intending to do just now is remove support for col_expr,
match(), copy(), etc. That throws up a few ideas...

Before we had ref(), it was essential that expressions could appear
anywhere in a sequence, as that (the correct binding of col_expr,
basically) was how match() and rmatch() worked. This also meant that we
had to at least countenance the possibility of multiple expressions in a
sequence, although what this actually meant was rather unexplored.

However, we can now revert to the more usual idea that a sequence should
hold only one expression. (Indeed, it would seem to be the case that only
a top-level sequence - a child of a rule or a child of an alt that is a
child of a rule - should be permitted to contain expressions. But this
seems a bit arbitrary too.) Obviously I need to think about all this
some more, but it would be nice not to have to worry about what multiple
expressions mean. It might also mean that we have a better way to tell
where a rule ends (although then we might have to support "{}" or ";" to
indicate the end of void rule).

And another thought sparking off that: the current typing syntax is
basically: "same as a cast". But in fact we use the type for a
declaration, and for sufficiently complex types we may run into trouble.
If we could have "same as a declaration" life might be simpler.

But back to the main plot. Ah: pacc0.c still uses match(). Darn. OK, now
it doesn't. For good measure I also fixed a few missing seqs in the
definition of _ which meant that pacc1 couldn't parse everything that
pacc2 can; I'm not really much into the idea of sPEG any more.

2010-02-19
----------

Just to finish off what I was saying about "make check"... No, it is not
particularly fragile: under all normal circumstances a failure of "make
check" would mean that something has gone very wrong. In general, the
hand-crafted tree in pacc0.c must be able to parse all pacc syntax that
is *used* by pacc.pacc, although pacc.pacc may itself *implement*
further syntax.

Normally, provided this is the case (and everything else is working OK!)
pacc2 will be correctly built, and "make check" will succeed. If it is
not the case, pacc1 will be unable to parse pacc.pacc.

But comments and identifiers in embedded C code are a special case,
since it is not *really* pacc that has to parse them -- it just does a
rough job, and leaves the details up to the C compiler. Provided we pull
out at least enough identifiers, pacc1 can build pacc2 correctly. But if
pacc2 then comes up with a tighter list of identifiers, pacc3 will be
different. Phew!

So, better make pacc0 strip C comments... OK, done that. And now "make
check" works again.

2010-02-17
----------

I'm just about ready to strip out the col_expr code, the match() macro,
etc. since ref() and friends are now used in all test cases. However...

The "make check" command currently fails, because pacc2.c is compiled
from a tree that includes "identifiers" from a comment in some C code
and pacc3.c is compiled from a tree that doesn't. Now, pacc0 definitely
doesn't understand C comments, and pacc2 definitely does. But it seems
to me that pacc1 *should* (since it is compiled from pacc.pacc), but
*doesn't* (since when it builds the tree from which pacc2.c will be
compiled, it includes those identifiers).

Just a sec, we can easily prove whether pacc1 understand comments or
not...

    ; sh run-all.sh pacc/comment1.pacc 
    ...
    ; ../pacc1 parse.pacc 
    input is parse.pacc
    output is parse.c
    expected Char at column 1

No, it doesn't. Definitely not.

But wait! I think perhaps it is my understanding that is faulty. Let me
see: pacc0 doesn't contain a parser, so pacc1.c is *not* built from
pacc.pacc, it is built from whatever tree is hardwired into pacc0.c.
This is interesting: it means that "make check" is a bit more fragile
than I suspected. (It may be a Good Thing for a test to be fragile.)

2010-02-12
----------

Just fixed pacc.pacc to support C style comments in .pacc files. (That's
"/* */" and "// to eol", as well as "# to eol" sh style, so there are
*three* comment styles, which seems a touch excessive.) It took only
moments, which is why I'm such a big fan of PEGs.

Hmm... so pappy is a functional pearl? I guess that makes pacc a
procedural swine. :-)

2010-02-11
----------

Today's goal: make pacc.pacc understand non-call bindings. That was
easy: it already does. (Hmm... I'm mildly disturbed by this; I will need
to code review pacc.pacc carefully at some point.) As a side issue,
there is yet another test harness; modelled on the automation in
test/emit, test/pacc tests the complete pacc parser with far less boiler
plate. Eventually I need to move all the pacc tests to here.

So, it should be easy enough to make pacc.pacc work with ref_str. Here
goes. OK, that was easy too. So I think this means that I can make
match() entirely go away! Let's try it...

Stop & Think is all very well, but a stroll round town can produce more
thoughts (and good ones, too) than I can implement in a week. Sigh. Any
case, I'd like to document the ref() interface, and change it slightly,
and also tidy up the new test harness considerably.

Currently we have this.

ref_t
    An opaque type which can hold a reference to a substring of the
    input string.

ref_t ref()
    Returns a reference to everything that is matched by the current
    rule.

char *ref_ptr(ref_t r)
    Returns a pointer to the start of the substring referenced by r.

size_t(?) ref_len(ref_t r)
    Returns the length of the substring referenced by r.

char *ref_str(ref_t r)
    Returns a newly-allocated, NUL-terminated copy of the substring.
    Almost equivalent to strndup(ref_ptr(r), ref_len(r)), except that it
    is more portable, and it calls nomem() if memory cannot be
    allocated.

Now, the uses that are coming up time and again are "ref_str(ref())",
and less frequently "*ref_ptr(c)". I propose to rename ref_str() to
ref_dup() and add either ref_str() or match() to mean ref_dup(ref()).

Also, what about having ref_0(c) to mean *ref_ptr(c)? And in the future
I intend to have ref_cmp(ref_t r, char *s). That would give us the
following.

ref_t
    An opaque type which can hold a reference to a substring of the
    input string.

ref_t ref()
    Returns a reference to everything that is matched by the current
    rule.

char *ref_ptr(ref_t r)
    Returns a pointer to the start of the substring referenced by r.

char ref_0(ref_t r)
    Returns the first character of the substring referenced by r.

size_t(?) ref_len(ref_t r)
    Returns the length of the substring referenced by r.

char *ref_dup(ref_t r)
    Returns a newly-allocated, NUL-terminated copy of the substring.
    Almost equivalent to strndup(ref_ptr(r), ref_len(r)), except that it
    is more portable, and it calls nomem() if memory cannot be
    allocated.

char *ref_str()
    Equivalent to ref_dup(ref()).

int ref_cmp(ref_t r, char *s)
    (Not implemented yet.) Has the same value as strcmp(ref_dup(r), s),
    but allocates no memory.

Complete? Consistent? Well, not too bad. Let's see how we get on.

2010-02-10
----------

Hmm. Here's a couple of things to watch out for. First, with a binding
like r:"y"*, we will first debind and produce a rule like "b.123 ::
ref_t <- "y" * { ref() }". Then we will derep and produce something like
"b.123 :: ref_t <- x.123 { ref() }", where x.123 :: void.  This seems a
bit inefficient: in this case at least, we could simply tack the
"{ref()}" onto the end of x.123. But maybe it's not as bad as I think;
worry about it later.

More seriously, as things stand we are creating rules of type ref_t:
this will be stuffed into the huge union. But we don't want it there! It
may well be quite large (I'd expect off_t[2] to be 16 bytes on most
systems for the next 10 years, whereas almost everything else in the
union will be 8 or even just 4 bytes). More drastically, the values are
already in struct intermed for every rule. So we may have to handle
ref_t extra specially.

Ah, OK, so the second one isn't really a problem at all: we just
"typedef off_t *ref_t"; if the union wasn't already big enough to hold a
pointer, then you're not doing anything serious anyway!

OK, OK, it's all looking good; several emit examples now work using the
new ref() stuff. I realise that match() is equivalent to
ref_str0(ref()), except that it doesn't work for pacc.pacc! So why is
that? Ah, OK, of course. ref_str0(ref()) is equivalent to match() at the
end of the sequence. But in this rule we rely on the left-of-expression
behaviour of the old match().

    Name :: char * ← NameStart NameCont* { match() } _

As far as I can see, this is the *only* rule where ref_str0(ref()) won't
do. OK. We can't fix that till pacc0 knows how to parse a non-call
binding (and pacc.pacc, come to that); I did say this was a major
change! So let's get back to fixing emit tests; currently have 137
passes, 13 fails... Sorted! And not only that, but I fixed mk-regress0.c
to use references instead of match, and this saved 4 leaked memory
blocks (according to valgrind). There are still memory leaks, but the
fewer the better, eh?

2010-02-09
----------

OK, here's my first sticking point. Consider this rule:

    S :: char * <- "x" r:"y"* "x" -> r

For a start, we need to get the precedence right here: * binds more
tightly than :. But at the moment I'm building parse trees by hand; I
don't need to worry about this till I change pacc.pacc to support
non-call bindings.

More urgently, two types of sugar collide here. The derep code in
desugar turns "y"* into a pair of rules of type void (well, int
currently, but that will be fixed). At this point, the binding of y *is*
to a call, but it has the wrong type!

I think all we need do is debind before we derep. So this rule would
first become this pair of rules:

    S :: char * <- "x" y:Y "x" -> y
    Y :: ref_t <- "y"* { ref() }

And then be dereped as usual.

2010-02-06
----------

So I know -- more or less -- where I want to end up. This *is* a major
change, which will break almost everything. What's my plan of attack? I
think I'll start with desugaring non-call bindings, which won't impact
anything else. Errm, then we need to implement the interface to embedded
code (i.e. the replacement for match()) and finally rewrite pacc.pacc to
use it. I think.

2010-02-03
----------

Sigh. Mainly due to a complete lack of time, I've not managed to
implement anything at all for all this stuff. I'm also hoping that I
will finally hit on THE solution which makes everything go really
simply. One good reason for this is that it will be an incompatible
change, and I'll have to rewrite large chunks of tests, and presumably
pacc.pacc itself. But probably I'll have to try a few different ideas
first.

Yet another thought. If I'm saying, which I am, that we ought to support
"swapping" the input, then couldn't we drop all this off_t stuff? We
could just use size_t or int or long... no, no no! The point is that we
must use a type large enough to address the entire input - hence off_t.

So let's run it through one more time. Suppose we have a rule like this.

    QuotStr <- '"' < EscStr > '"' _

Ahem. It has just occurred to me that we could just as well say

    QuotStr <- '"' e:EscStr '"' _ -> e

Currently, it is not possible to have anything other than a call in a
binding, but I do intend to add sugar to make that possible. As another
shortcut, we could say that a binding to some particular name, or maybe
the symbol $, gives the value of the rule. (Perhaps it would be cleaner
to say something like this. It is an error if a sequence which is not
void fails to contain an expression which gives the value of the rule.
Exception: if a sequence contains no expression, but does contain a
single binding, the bound value is the value of the expression.)

What about this rule from pacc.pacc?

    Name <- NameStart NameCont* { match() } _

We're considering possibilities like these:

    Name <- < NameStart NameCont* > _ { ref() }   # bra/ket matching
    Name <- n:(NameStart NameCont*) _ -> n        # nothing special
    Name <- $:(NameStart NameCont*) _             # $ rule
    Name <- n:(NameStart NameCont*) _             # any binding rule

What about this example from earlier?

    QuotStr <- '"' < (!'"' . )* > '"' _

Noting that repetition operators bind more tightly than, err, binding,
we would write that as:

    QuotStr <- '"' q:(!'"' . )* '"' _ -> q

And the crux of it all:

    Char <- c:. -> c

So, blimey, all that stuff about marking matches is completely bogus.
OK, good. I'll only be slightly sorry to drop ⋄.

But there's still some work to be done! How do we extract references
(and char* copies) from a rule? Let's call a rule which is created by
the desugaring that handles a binding to something other than a call an
"iRule". We could say that the type of an iRule is always ref_t; if you
need a char* copy, you have to create it yourself (using ref_to_chars(),
or somesuch).

Do we actually need to create a new rule? Could we not revive the column
marking idea, but with the cleaner syntax of binding? Well, maybe,
but...

For a rule, we already know the start and end columns! (They are
currently called col and remainder in struct intermed.) Aaahh! Did I say
I was "hoping that I will hit on THE solution"? Well, this simple
factlet makes me suspect that I'm getting close! :-)

2010-01-22
----------

Some more thoughts on this range stuff. Could we avoid a struct
altogether, and say that a range is represented by off_t[2]? (Just at
the moment, pacc.pacc doesn't recognise square brackets as type syntax,
but that wouldn't be hard to fix.)

    Char :: off_t[2] <- < . >
    NameStart :: void <- c:Char &{ isalpha(pacc_byte(c[0])) || ... }

Also, if we have <> some way to stash the current column, there is
really no need ever to stash the column of an expr: if you need anything
other than the entire match, you have to mark it.

2010-01-21
----------

My thinking for struct pacc_str is wrong. I want to handle the case (or
at least not make it infeasible to handle the case) where sizeof(off_t)
> sizeof(char *). So we want something like this.

    struct pacc_range {
	off_t fst;
	off_t lst;
    }

Also, accesses to a range must be made via a function / macro (since,
potentially, we want to allow swapping of the input string). As a
strawman, let's call this pacc_byte(). So the archetypal case would be
something like this.

    Char :: struct pacc_range <- < . >
    NameStart :: void <- c:Char &{ isalpha(pacc_byte(c)) || pacc_byte(c) == '_' }

We urgently need to sketch out a more complete interface between the
pacc engine and the code fragments it holds.

2010-01-19
----------

OK, I'm thinking - yet again - about match(), and copy(), and also
call/lit matching. I'm trying to come up with a mechanism that meets the
following criteria.

1. It could be even more general than match() - it is hard to argue that
there's anything wrong with Piumarta syntax:

    QuotStr <- '"' < (!'"' . )* > '"' _

2. It should make it possible to handle call/lit matching as syntactic
sugar.

3. It should avoid the copying overhead (not to mention memory leaks!)
associated with match(). This implies using struct pacc_str, or
something similar to it. Although, with Chris Torek and Dan Bernstein, I
abhor typedefs that eliminate structs, maybe we need one here.

Here's what I'm thinking about at the moment.

Invent some syntax (maybe any of the characters < > <> ⋄) which means
"store the position of col". Then make all stored cols available to
exprs, perhaps in the variables pacc_col0, pacc_col1... Maybe if no cols
are stored, we have some defaults for the left, "middle" and right col.
(The middle is where the current expression is, of course.)

Make match() return a struct pacc_str, and copy() a char *. Finally,
season with some defaults: a missing expr is equivalent to
match(pacc_col0, pacc_col1) or copy(..), depending on the type of the
rule. So the quoted string example would go something like:

    QuotStr :: pacc_str <- '"' < (!'"' . )* > '"' _

The current match0.pacc says:

    A :: char * <- "5" "6" { match() } "7"

That would become

    A :: char * <- < "5" "6" > "7"

But I need some more convincing examples to finish off the details.

2010-01-15
----------

Some of Bryan Ford's stuff is now available in the git repo in baf/.
I've had various ideas about pappy-compatibility, either directly or via
a translator. However, this is not worth much. For a start, pappy is not
maintained, so there are unlikely to be many extant pappy grammars. And
even if there were, who would want to go from Haskell to C? Being
compatible with Rats! would make more sense (and I don't intend to do
that).

More seriously, we can't support or translate Haskell code, including
type definitions. So even if we don't care about semantic values, we
can't automatically translate pappy into pacc. I guess I'll have to
build my own Java.pacc - obviously it will be derived from Java.pappy.

Incidentally, I was right to assume that Java.pappy includes no semantic
expressions. I can achieve this effect with pacc.pacc by returning early
in pacc2.c. This saves only 20000 bytes, yielding 539 heap bytes per
input byte.

Here's a crazy idea: write a pacc grammar, not for es (which has a
somewhat broken back end, to say the least) but for rc. OK, I'd rather
spend serious shell hacking time on es, but rc would be a "real world"
client much sooner. Hmm.

Another, less crazy, idea:

    struct pacc_str {
	char *start;
	off_t length;
    };

We could then make match() yield a struct pacc_str, which would save a
lot of copying. Alternatively, the struct could hold two char *
pointers, start and end.

Also, I need to think whether I intend to support the syntax for
matching a call to a literal, as in:

    For <- "for":Word ...

This is much used in Java.pappy. It should be, maybe using struct
pacc_str, maybe not, to make this work without having to make a copy of
the Word.

2010-01-13
----------

As I mentioned before, space optimization is pretty much done; I can't
think of much else to squeeze. Except that...

    union { ... } *evlis;
    size_t ev_alloc;
    size_t ev_valid;

This is using a lot of space for something that is unlikely to be more
than a few elements long. It seems to me that the maximum length of an
evlis can be computed statically, by walking the tree and recording:

    for an expr, 3 (thunk id + 2 columns)
    for a call, 2 (call id + 1 column)
    for a seq, the sum of its children
    for an alt, the maximum of its children

That should allow us to get rid of ev_alloc altogether, since we would
know in advance how many evlis entries to allocate. We could also use a
smaller type for ev_valid. Maybe.

2010-01-09
----------

From TODO; now done!

Complete the hashing regime

    We need to select a suitable size for the hash table. I'd like
    instrumentation too.

2010-01-08
----------

D'oh! Stupid, stupid off-by-one error, took ages to spot. Now fixed, and
the integrated core/col list yields 545 bytes of heap per input byte, or
with _FILE_OFFSET_BITS=64, 787 bytes.

I'm not sure that there's much more squeezing for space to be done at
the moment. I should, of course, get hold of Ford's Java parsers, and
see how they measure up. I strongly suspect that he doesn't actually
build a parse tree, which would of course save loads of space.

2010-01-07
----------

Looking at, thinking about it some more. We need: the cores list, using
2 bits for discrimination and 2 bits to indicate how many additional
columns are pushed. (We will define macros to access the parts.) So
we're down to 28 bits for ids, but that still allows 268m nodes in a
tree!

OK, got that going. (I haven't actually defined macros to access the
parts; perhaps we don't need to.) Next let's remove the thrs stuff.
Done, and all warnings tidied up. So we now have:

==17129== malloc/free: in use at exit: 1,230,881 bytes in 30,883 blocks.
==17129== malloc/free: 44,716 allocs, 13,833 frees, 1,997,329 bytes allocated.

Which, since pacc.pacc is still 2655 bytes, means that we are now down
to 752 bytes of memory per input byte. Pretty impressive improvement,
but I just checked and Ford achieves 441 (pappy) and 297 (hand
generated) for his Java parsers, so we're still using way too much
space. Onwards...

Here's a thought. Since we have decided that the space for rule ids is
(long - 4bits), we could use the rule in struct intermed to encode the
status. Sigh. OK, this yields 694.

And although separating cores and cols looks promising, it means that
we've got *4* size_t counters in the struct. This means that the union
version would probably be more efficient: we might lose 4 bytes every
time we save a core (assuming long = 32 and off_t = 64 bits), but we'll
save 8 bytes in every struct intermed. There'll also be less wastage in
allocated but unused core/col list entries. Better try it.

2010-01-06
----------

So the new-style encoding of the call/eval list is working, in that the
evaluator is running off the new-style lists. Currently something else
(bindings, I think) still requires the old-style.

Eugh. The bindings stuff is pretty grody, since we've broken the link
between cols and cores. Let me see... we find the position on the stack
of the column by counting how many bindings there are. OK, OK... it
should work now, but removing the old-style pushing breaks it. I wonder
why?

2010-01-05
----------

Happy New Year, pacc! 

Picking up on the last comment, surely the whole point of the --feeder
function is that pacc *does* tie everything together into a single
parser.

Here's a one-liner to output some places to start searching for primes
(unfortunately I don't have primegen handy here, so I'll continue this
one at the shop).

perl -le '$x=0;while(($e=int(exp($x/1)))<2**32){print $e," ",100+$e;++$x}'

This only produces 23 numbers, which I don't think is quite enough.
Change the "1" to "1.8", say, for more numbers.

Anyway, back to the more efficient encoding of thunk/rules. The current
struct thunkrule contains a discriminator, an int (which is either a
thunk number or a rule number), and a column (which is only meaningful
if the item is a rule call). Obviously we can push rules and columns
separately, but columns are really of type off_t, not int. I do think
it's desparately important to use 64 bit offsets, otherwise we can't
parse any input larger than 4G. (On the other hand, we can't currently
parse anything that won't fit into memory, so that limits us to 4G on a
32 bit machine already. But maybe one day we have to lift that
restriction.)

Whilst poking around in /usr/include/bits/types.h, I notice that "in all
the GCC configurations we [glibc] support: long long is always 64 bits,
long is always word/address size, and int is always 32 bits." This makes
me think that many of the current uses of int should be long.

Anyway, how to deal with two potentially different sizes of thunk/rule?
We could union them up, I suppose, although the idea of using the bottom
2 bits to discriminate types of different sizes ain't gonna work. OK,
here's the most compact representation I can think of: there is a long
on the stack; if the bottom 2 bits indicate that it is a call, the next
thing on the stack is a column, of type off_t. But we can't express that
in portable C. If we union up the two types, the union type is as large
as the larger of its members.

OK, so it's a bit strong to say "can't express in portable C". At the
bottom level, we *could* have an unsigned char array, and manipulate it
directly with much use of sizeof etc. But then we have to worry about
alignment issues instead. So scrap that.

Here's yet another thought. The struct thrs array is used as a FIFO: we
always push new values onto one end and always read linearly from the
front. So we could simply have two arrays, one of longs and one of
off_ts. Read first element of the long array: if it's a call, read the
first element of the off_t array. Yes, that should do it.

2009-12-24
----------

I've started to implement the "partialization hack". On reflection, I
think I'm going to refer to this as "feeding", so for example we might
say "pacc --feed-rule __ --feeder catline ...", where "--feeder"
specifies the function that will be called to get another line of input
from the user. But I need to be sure what the interface to all this will
look like: bearing in mind that there are two different parsers
involved. I kinda think that pacc should provide the code to tie them
together, but implementing this might be a little way off. Be nice to
make it work, though, because I could then start playing with the es
grammar as another client for pacc.

On the same lines, I could quickly whip up an approximate XML parser in
pacc. Of course, baf suggests that XML is the sort of thing that pacc
isn't particularly good at (not much syntax, lots of data), but for
precisely those reasons it would make for some interesting test cases.

2009-12-21
----------

So hashing on rule id works! And indeed, it does appear slightly
advantageous to use non-contiguous IDs: the longest hash chain is now 21
items (was 22), and we're down to 7369 bytes per input byte. (But is
that just noise, really?)

There's some code in emit.c that looks quadratic, searching for rules by
name. I'm sure that can be improved.

2009-12-20
----------

Right, so we have working emit tests. I think I've got some insight into
why we need thunk stacks in intermed results: suppose we have a grammar
like S <- A B / A C. Having parsed an A at column 0, and worked out
which thunks we need to call to evaluate it, we then look for a B. If we
don't find it, the alt code must unwind everything we've discovered so
far, and then try again to match an A at column 0. So along with the
cached result that there is indeed an A at col 0, we need to cache what
thunks must be stored to evaluate it. (Obviously in this case we could
simply left factor the rule into S <- A (B / C), but in general this
doesn't solve the problem.)

And it's not obvious that we can easily manage the memory any better
than we currently do, with a dynamic array per struct intermed. Hmm...

One way that we could make the thunk list smaller is to use distinct
numbers for thunks and rules. In particular, if we start numbering
thunks from n_rules, then anything < n_rules must be a rule, anything
greater must be a thunk. That almost eliminates the need for the
discriminator, except that I've just remembered that we currently
distinguish calls that are / are not bound to a variable. Hmm...

Anyway, one thing to look at very soon: there are some promises in
emit.c that lookup_rule can go away when we hash. Now that we hash, is
that true? Yes: the idea was to hash not on the rule number, but simply
its id. Under hashing, it doesn't matter - in fact it's probably an
advantage! - that rule ids are large and non-contiguous.

2009-12-19
----------

Oh bugger. I really need to review how the thunk stack is built. But to
do that, we're going to have to make the emit tests work again. OK...

2009-12-17
----------

So hashing fundamentally works. We now have:

==7607== malloc/free: in use at exit: 11,534,549 bytes in 5,898 blocks.
==7607== malloc/free: 8,508 allocs, 2,610 frees, 20,116,713 bytes allocated.

That's a distinct improvement, although still a rather enormous 7576
bytes per input byte. Oh well, we still have a thunk stack in each
struct intermed.

2009-12-16
----------

All's broken, and I started trying to get the emit tests back to fix it.
But although that might be a useful thing to do anyway, it's not
strictly necessary. What I can do is roll back to a version that worked
with the full matrix, make the emit.c changes I have to recalculate cur
and debug those, then reinstate hashing. Here goes...

OK, well there's a serious snag in the theory: pacc0 can't build
anything other than pacc1, and if the emit code is broken, that pacc1
doesn't work! So we really do need to get the emit tests back. But I've
managed to sort out at least part of the puzzle anyway. Even better, it
actually saves an entire push()!

And that now works in the mainline. Hmph: the pushm2() code does
actually work as is, because we never allocate new matrix entries while
evaluating. So we could leave it for now.

2009-12-14
----------

OK, so some hashing is in place, and it doesn't work! At least one
problem is that we have been in the habit of pushing "cur" addresses:
this is what pushm() and pushm2() do. But the way the hash buckets work,
struct intermed pointers are likely to be realloc()ed under our feet! So
we will instead have to push col/rule pairs, and call _pacc_result()
again when we pop them. Hmm... is this what we really want? Well, yes.
Unless we go for linked list hash buckets, this is inevitable. And I'm
not going to start implementing anything that mangles pointers on a
stack... had enough of that with es!

OK, so in emit.c, cur_rule always holds the current rule: we *don't*
need to push that, just col will do fine. I think.

2009-12-13
----------

Messing about with matrix names, and it occurs to me that I ought to
make explicit my goal: a pacc parser should be re-entrant. More
specifically, an application should be able to include several parsers
(one parser understands one grammar), and several instances of a parser.
Furthermore, although most applications will include only a single
parser, which will come from a single C source file (well, a .[ch]
pair), for applications that do include more than one parser, as much
code as possible that's shared between parsers should occur only once
(in a separate .[ch] pair). This means that template.sh will have to get
a lot more complicated!

Anyway, that's the goal. Now back to your regular schedule...

Of course, _pacc_result() is all very well, but shouldn't this code be
inlined some day?

Now, I'm pretty close to doing hashing; It Would Be Nice to have some
non-trivial examples other than pacc.pacc. Should we get baf's Java
parser, and make it work? On 'tother hand, there's plenty else to do,
and, y'know, there's going to be enormous memory savings just on
pacc.pacc. Do I really need to glory in just how wondrous they will be
on other examples? The current implementation can hardly be considered
serious when...

; wc pacc.pacc
 129  465 2655 pacc.pacc
; valgrind ./pacc2 pacc.pacc -o pacc3.c
...
==7661== malloc/free: in use at exit: 72,312,011 bytes in 4,488 blocks.
==7661== malloc/free: 4,496 allocs, 8 frees, 72,312,063 bytes allocated.


That's 27236 bytes of memory for every byte in the input file! I think
we can do better than that...

By the way, here's a valgrind command to take account of the way we're
using stdin:

valgrind '--db-command=gdb -nw %f %p < /dev/tty' '--db-attach=yes' '--input-fd=7' ./pacc1 pacc.pacc -o pacc2.c <[7] /dev/tty

2009-12-11
----------

Interesting. So, I have made the same problem pop up somewhere else. Let
me review this...

Make s_alt do nothing clever, so alts are always nested. When the Rule
result reads "s_kid(alt, cons(s,r))", with 5 names in the code, we have
the problem. Change the result to "s_alt(s, r)", with just 3 names, and
it all works. Add in "/* x y */" and it fails. With 4 names or 6 names
in scope, it all works; with 5 it fails.

Now, the Rule rule is *much* simpler than the Matcher rule. In fact, it
doesn't *have* nested alts. Hmm... "s_alt(s, r) /* x y */" fails when
s_alt creates nested alts. When s_alt denests alts, it succeeds. (Here,
"it" means "make check", that is, pacc2 can parse its own grammar and
build the phase 3 compiler.)

Proposal: I can make this problem go away by denesting alts, which is
the Right Thing anyway (although it probably ought to live in the
optimization pass, rather than initial tree building). *All* the code in
emit.c needs a severe reviewing, and I may just happen to solve the
problem anyway. And I've got a few new tests up my sleeve to try and
elicit bogus binding behaviour. Let's try those...

2009-12-10
----------

One of those utterly baffling bugs. In both pacc0.c and pacc.pacc, a
bound call is constructed by "s_bind(n, u)". Turns out that s_bind() is
not declared, so this produces integer / pointer warnings, but works.
Replace that call, in pacc.pacc, with "s_both(bind, n, u)", which
appears to be completely equivalent, and the resultant pacc2 just
doesn't work: it fails to parse even the simplest grammar.

It's hard to see what's wrong, or even different, about this pacc2. I
wonder if there is some shared structure somewhere? Or am I wrong about
the two calls being equivalent? Next step: rewrite s_bind to use s_both.

OK, so now I have s_bind() defined as "return s_both(bind, n, c)", and
declared in syntax.h. Every works. "Inline" the call to s_both into
pacc.pacc, and it falls over, in just the same way that it did before.
What on earth is going on?

I still have absolutely no idea what is going on, but I have narrowed
the problem down to the following. The call to "s_both(bind, n, u)"
brings 4 names into scope (s_both, bind, n, and u), whereas "s_bind(n,
u)" only 3. Any time we have 4 names, even with something like
"s_bind(n, u) /* x */" the resultant parser is broken in exactly the
same way. First guess was that it was something to do with stack
overflow in emit.c, but apparently not. (We do have a lot of stacks, but
most of them now indicate when they overflow.)

Another data point: adding *more* identifiers ("s_bind(n, u) /* x y z
*/") also makes the problem go away, so it's not merely something
overflowing, and in any case I believe that all stacks are now overflow
protected. The mighty valgrind doesn't report any problems.

I'm still baffled.

OK. I believe that the problem is as follows. Something about the syntax
tree we build for the Matcher rule causes us to emit code that is
faulty, and somehow fails to recognise valid inputs. If this is so, then
making the s_bind -> s_both change in pacc0.c should result in a pacc1
that fails the same way pacc2 now does. Try it... No, because the
salient property is the length of bind list; change *that* to include
extra names... 

No, this is not the case. When pacc0.c builds this tree:

        expr 371: s_bind(n, u) /* x */ 
          ident 370: s_bind 
          ident 369: n 
          ident 368: u 
          ident 367: x 

and pacc1 parses pacc.pacc to produce this tree:

                  expr 326:  s_bind(n, u) /* x */  
                    ident 325: s_bind 
                    ident 324: n 
                    ident 323: u 
                    ident 322: x 

This implies that the problem is a bizarre interaction between the
nested alt nodes, and the bind count. Now, at some point nested alts
should go away... shall we do this? But I'd also like to understand why
they fail in this case (if, indeed, making them go away fixes the
problem). Hah...

OK, denesting alts makes the s_both() code work. What about renesting
them, by removing the conditional case in s_alt()? Err, no, *that* also
works fine. Hang about...

Restoring the old nested alts code in pacc.pacc brings back the
"expected Char at column 55". Brain fade. Good night.

2009-12-07
----------

I'm starting to implement hashing, and have been reading up on hash
chains, open probing, double hashing, radix searching, patricia and so
on. It occurs to me that in the current interface we always know the
input length before we start parsing, and the number of rules. So
assuming we can guess the "matrix density" fairly sensibly, we've got a
pretty good idea exactly how many slots we will need. So it should be
possible to avoid rehashing.

(There might be an option to generate a parser which doesn't know its
input size up front, in which case rehashing would be necessary. But
let's worry about that another time.)

Not many data points so far; basically pacc itself is the only serious
grammar we have. At present, that allocates 143154 results, and used
16426 of them, or 11.5%. Looking at that and a couple of the larger test
cases suggests that 10% - 15% is about right for the matrix density. So
that gives us a few options.

1. Using hash chains, we can say that we will never need to rehash; the
worst that can possibly happen is that the chains get a bit longer than
we'd like. Suppose we have (rules * input length / 100) chains, then the
average length of chains equals the percentage density; about 12 items
per chain. We'd want around 1500 hash buckets for pacc.pacc. The only
disadvantage is that we need a next pointer in every result.

2. If we double hash, although we can size the table so that it *should*
be big enough, we can't guarantee that. We save on the next pointer, but
we will need to be prepared to rehash if the table gets more than about
70% full. Hmm...  and the initial estimate will be big. We'd maybe go
for (rules * input length * 0.17), so for pacc.pacc we'd allocate about
24000 slots. The wastage there kinda dwarfs a few next pointers, but on
the other hand we never have to allocate any memory.

In either case, we have the interesting problem of picking a size for
the hash table. Ideally, of course, it should be prime, which makes me
think of testing for primality on the fly, or (better) starting with a
lookup table of a few dozen primes.

2009-12-04
----------

Template file done as described, and aren't I glad I did! The best part
is that, because it's (almost) a real C file, vim's syntax highlighting
works, unlike emit.c where all the emitted code is just pink. (Now that
gives me an idea, what about a vim syntax definition that can understand
embedded code... :-)

Anycase, I'm just redoing the test harness so that it can make use of
errors, and it occurs to me that we need to generate useful errors for
failing predicates.

2009-12-03
----------

OK, so I'm starting to get error handling working, and it's actually
rather pretty! (The interface, not the implementation, yet.) When
harness is a baf3 parser, we get:

    ; ./harness '3+4*('
    expected Additive at column 5

However, I'm not entirely happy with this:

    ; ./harness '3+4*(5'
    expected Right at column 6

I would prefer the error to read like this.

    expected ")" at column 6

And this error is actually generated initially (by the failing lit
matcher), but then over-written when pacc discovers that the Right rule
(which is just Right <- ")" Space) has made no progress at all. To do
this, we'd either need a heuristic to identify when rules should
over-write error information, or a way to annotate such rules in the
grammar. (Or both!)

Heuristics might include: capitalization of the rule name (but I really
hate this); whether the rule contains any alternatives.

Annotations look more flexible, but I don't know what syntax I'd like to
use. Straw man: something like this?

   D <- "0" { 0 } / "1" { 1 } / ... / "9" { 9 } ~ "decimal digit"

Just at the moment, I can't see how Pappy handles errors at all, and I
don't have the Rats! documentation to hand.

Any case, at the moment, I'm not going to make any radical changes,
except to clean up the code.

OK, so now I'm thinking about getting rid of the -part.c bodge, and have
decided that we really must have a template file (rather than emit.c
sprouting a new rash of printf()s). So how would that work? Suppose we
have pacc.tmpl, with some kinda marker at the points where we currently
include pacc-part.c. Then a sed script could create template.c, which
holds the three functions pre_decl(), pre_engine(), and post_engine().
Let's do it!

2009-11-27
----------

OK, so I worked out why pacc2 and pacc3 were different! (A bug in
pacc.pacc.)

The interface has already changed, and I'm starting to think
that we probably *do* want a "-d" ("--defines"!?) option that also
writes a .h file. The current interface to the parse function is:

    int parse(char *, off_t, <type of start rule> *)

(The appearance of off_t is a bit of a nuisance, since it means that we
depend on <sys/types.h>, but it *is* correct.)

Anycase, the TODO list has moved to (you guessed it) TODO.

2009-11-25
----------

Here are a few interesting statistics:

    ; make clean; time make
    ...
    mv pacc2 pacc
    2.10user 0.52system 0:03.88elapsed 67%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+4664outputs (0major+57028minor)pagefaults 0swaps

    ; time ./pacc1 ``() { cat pacc.pacc } >/dev/null
    ...
    0.05user 0.17system 0:00.34elapsed 63%CPU (0avgtext+0avgdata 0maxresident)k
    0inputs+0outputs (0major+17741minor)pagefaults 0swaps

    ; wc pacc.pacc 
     129  460 2643 pacc.pacc
    ; wc pacc2.c
      6334  28437 193367 pacc2.c

I can't think of anything much else to measure at the moment.

So that brings us to the command line interface. Obviously enough,

    pacc pacc.pacc

should read pacc.pacc and write a parser to pacc.c. The "-o" option can
specify a different output file.  Just at present, I can't see any
reason for a separate .h file, since the interface is generally
predictable: int parse(<type of start rule> *, const char *).

2009-11-24
----------

Where next? Here are some things that are outstanding, in no particular
order.

1. Work out what the command-line interface to pacc is
2. Replace the intermediate results matrix with hashing
3. Review of the parsing engine - do we really need so many stacks?
4. New test cases and harness
5. Autoconfiscation
6. Only one instance of each type per union
7. Handle void types
8. Dynamically allocate everything
9. Parser optimization / manipulation (including partialization hack)
10. Tidy up tracing
11. Instrument and measure current time / space performance
12. UTF-8 support

Just off the top of my head, the most urgent would be 10, 11, 1, and 4.

OK, and I have tidied up tracing, at least to the point where it's
possible to remove all the sed hacks from the Makefile.

2009-11-13
----------

Friday the Thirteenth, and the loop is tied! Specifically, paccman.c
builds a parser (stage1) which can parse pacc.pacc and build another
parser (stage2), which can parse pacc.pacc and build another parser
(stage3), which can... As you would expect, stage2 and stage3 are
identical. A brief eyeball suggests that the stage1 parser differs only
in id numbers (since the tree is built in a slightly different order).

2009-11-05
----------

It's funny how even the trivialest aspects require design decisions
(which I hope to get "right": partly to save time later, but mostly just
out of a sense of elegance). For example, it's just become crystal clear
exactly how identical "expr" and "guard" nodes are in the tree: they
both hold a hunk of code in the node itself, and a list of identifiers
used in that code as child. But we can't distinguish them by context
(both are typically children of a seq).

So I think I can see s_retype() about to be created...

2009-10-14
----------

Interesting, becoming an expert in a system which doesn't yet exist!
I've been trying (once again) to get my head around how we can tell
where a rule ends. For quite a while, I was under the mistaken
impression that Ford can parse "a <- b c <- d" because the parser backs
up when it sees the second arrow, and decides that the "c" must belong
to a new rule. This isn't how PEGs work!

In fact, Ford can tell where a rule ends, because every rule *ends* with
an expression "a <- b -> foo c <- d -> bar". Since only one thing
appears to the right of a right arrow, the "c" must start a new rule.
OK.

D'oh! And Piumarta solves it so easily, by saying that an identifier
that is a call is "identifier !EQUAL". Guess who'd forgotten about
syntactic predicates?

2009-10-08
----------

I seem to have all the above worked out, and almost working. I'm about
to wave a wand of simplification over the thrs array. Here goes...

2009-10-03
----------

OK, so the problem is with getting the right column for bindings.
Suppose we decide that the parse we are working on will need to evaluate
a thunk, which binds names "a" and "c"... we need to push onto the
thr_stack the columns at which those names were bound. (Note that we
must be prepared to undo this if we abandon this parse, but I think
that's handled fine already by restoring thrs_ptr.)

So, when we see the binding of "a", we need to record the current
column. This can't simply go straight onto the stack already, because we
don't know what thunk(s) are coming up, nor yet what names they will
need. (What about "b", which was bound, but is unused in a given thunk?)

Currently, there is a single list of names encountered and the columns
at which they occur (oh, well, actually two coupled lists). That might
work, but we will need to save / restore its pointer along with all the
others. Hmm... let's have another look.

No, it's hairier than that. What col we are at is dynamic (in the
parser), whereas what names are bound is static (in the
parser-generator).

2009-10-02
----------

Ugly hacking session, all is confusion. I have a feeling that the
intermediate results matrix is built funny. I don't really care, since
it's going away soon, but it doesn't help my head!

2009-09-30
----------

Major progress on the parser, but I've run into a tricky problem with
bind / guard. When a name is bound, bind_pre() calls pushcol() to record
the column where it will be bound. Then emit_expr() popcol()s each
column from col_stack to do the actual binding. Note that emit_expr()
assumes that all names in scope will be bound.

Guards, on the other hand, declare which identifiers they will bind, and
guard_pre() uses this information to bind only those, *without* calling
popcol(). It finds the appropriate column from the thrs stack. Needless
to say, this means that there are more pushes than pops of the
col_stack, and hilarity ensues.

Obviously the right fix is to make exprs follow guards, and declare
which identifiers they need bound. (Apart from anything else, what
happens if you have two exprs in a seq?)

This is easy enough to do, but the hard part is that almost every test
case, and certainly every rule in paccman.c will need to be rewritten.
So, just for a temporary workaround, I've made guard_pre() call popcol()
once for each binding.

2009-09-17
----------

OK, so there's been a bit of a hiatus, but now I'm back on the case. The
file paccman.c contains code to build, manually, a tree that will
represent a parser for PEG itself. This needs to be copied to mktree.c
(and there is also a slightly customised foo.c to go with it for now).

Call the (fully-featured) language that pacc will understand tPEG. Call
a simplified version of this sPEG. Then the code in paccman.c builds a
parser for sPEG. 

We will also need a definition of tPEG written in sPEG. This will be
called, say, pacc.pacc. The parser built by paccman.c will read
pacc.pacc, and build a parser for tPEG.

What, if any, are the differences between sPEG and tPEG? Obvious
candidates include constructs that will eventually be desugared, such as
repetitions. And where there are alternate syntaxes, for example "←" or
"<-", sPEG need only support one alternative. And sPEG need not
understand defaults, such as default types, nor comments.

So it's going to be useful actually to *define* sPEG and tPEG. OK.

There's some other stuff going on too. In syntax.c we are starting to
acquire tree builders (with unsatisfactory names like mkcall(); would
not s_new_call() or s_call() do as well, if not better?). These are
necessary because we have semantic *expressions*, not actions, in pacc.
So for example, the matcher that calls another rule definition is
expressed like this.

    struct s_node *SeqRule ← i:Identifier { mkcall(i) }

Thing is, these tree builders are mighty useful. We ought to use them in
paccman.c itself. In fact, we ought to rewrite *all* the examples and
test cases we have so far to utilize the tree builders.

Or maybe not. After all, they need to be convenient to express trees as
built by pacc, not manually. Let's get the parser going first, then see
if we want to rewrite anything!

2009-07-24
----------

Well, repetitions work as advertised. However, I believe that although
the direct implementation is very straightforward, we must in fact
desugar repetitions into alts and seqs. If not, we risk the parser
becoming non-linear. (I need to cook up a test case which does in fact
show non-linearity here.) Basically the point is that only an entire
rule is memoized; if repetitions don't utilize rules, they are not
memoized. I'm inclined to leave the working repetition code for the time
being, though.

2009-07-07
----------

Repetition: star, plus, query. I'm minded to implement a generic
repetition operator, that matches any number of reps between a given min
and max (with max = 0 implying infinity). Thus query is rep(0,1), plus
is rep(1,0), and star is rep(0,0). (Interesting that this covers 3 out
of four quadrants: the fourth - rep(1,1) - is simply a normal match.)

This would also allow rep(2,3), for instance, although we currently
don't have a syntax for that. Errm, I guess "x @ 2 3" would do at least
as a straw man.

Now, the only possible hitch here is that we now need a s_node of type
"rep" to hold two integers. Do we just add them into struct? Or union
them with char *text? Or have some kinda "number" s_type, that we can
hang off "rep", like "call"s hang off "bind"s?

2009-06-30
----------

Blimey. Here's something really mad. If we keep the number of specials
in the grammar really small, then instead of this::

     int Additive <- m:Multitive PLUS a:Additive { m + a } / Multitive

you could instead say this::

     int Additive <- m:Multitive + a:Additive { m + a } / Multitive

This is a really groovy idea, but we almost certainly do need at least
the characters ?*+.(){}/<>-"!: which doesn't leave many. Darn.

2009-06-27
----------

A showstopper?

I'm starting to close the loop, since pacc is almost clever enough to
emit a parser that can read its own grammar. However...

    char *Item ← s:Whatever { n = s_new(rule); n->text = s; n; }

The idea of semantic expressions starts to look a bit murky.

We could just say that we don't care: you must define functions like

    struct s_node *new_rule(char *t) {
	struct s_node *r = s_new(rule);
	if (r) r->text = t;
    }

    char *Item ← s:Whatever { new_rule(s) }

Is that too draconian?

2009-06-26
----------

OK, what I have right now is a "match()" macro, which returns the text
matched by the entire rule. It's kinda groovy, but considering cases
like this:

  char *Ident ← IdentStart IdentCont* { match() } Spacing
  void Spacing ← (' ' / '\t')*

it's a pity that match() will return the Spacing too.

(By the way, I think it would be insane to have anything like match() as
a default! It allocates memory...)

It would be easy to store the column where an expression is encountered
so that the above would work. (I was considering lmatch(), rmatch(), and
match(), which respectively match everything before, after the
expression; or the entire text matched by the rule. In fact, it would be
simpler just to have match() and rmatch(), which match to the left and
right respectively; if you want the entire expression, put match() at
the end - or rmatch() at the beginning.)

Arguably, Piumarta's bra & ket syntax is even more flexible, and there
are times when that flexibility is warranted.

  char *String ← '"' < [^"]* > '"' Spacing

(Note that this doesn't allow a string to contain " characters.)

  char *String ← '"' d:StrEsc '"' Spacing → d
  StrEsc ← ( '\\' . / . )* { match() }

Combination of match() with stashing col_expr handles all the simple
cases. And particularly if we allow binding to anything (by generating
extra rules), you can get this, which is barely even any longer.

  char *String ← '"' d:([^"]*) '"' Spacing → d

Here's a wacky one: why not phrase that as

  char *String ← '"' =:([^"]*) '"' Spacing

Oh dear, oh dear. Stop it now. Make match() work as you said you could,
and move on. We can always revisit this later.

OK, match() means lmatch(), and rmatch() exists. Cool.

Here's a bit of a puzzle that I really don't know the answer to. First,
is it permitted to have more than one expression per rule? For example,
what about something like this.

  char *Concat ← Word { a = match() } '^' { b = rmatch() } Word { strcat(a,b) }

Well, no, it certainly will never be possible to bind names in an
expression. At best, all bar the rightmost expression are evaluated - as
expressions - in a void context, as if left-hand operands of ",".

Anycase, the obvious thing to do at this stage is to say that there can
only be one expression per rule. Or rather, one active expression per
rule: we need this to work.

  int Digit ← '5' { 5 } / '6' { 6 }

So the top-level of a rule can be an alt, in which case each leg of the
alt can have its own expr. But what about non-top-level alts?

  <example here>

2009-06-17
----------

Pressing on with semantic expressions. Each rule has a type (worry about
defaulting later).

Rules of type "void" never yield a value, of course. Should we allow
expressions that are evaluated for their side effects? I guess so.
Clearly it is not legal to bind a name to a "void" rule.

Rules of most other types are fine: the semantic expression must yield a
value of the given type:

  int Additive <- m:Multitive '+' a:Additive { m + a }

But we need at least something else: a way to extract parts of the input
text. A rule of type [ haven't decided: either "char *", or some special
type like "text" ] has a default value (in the absence of a semantic
expression) of [ haven't decided: either the entire text matched by the
rule, or the concatenation of the parts of it that aren't matched by a
"void" call, or ... ]. This default value is freshly allocated by a call
to realloc() - to use your own memory allocator, simply #define realloc
in the raw code at the top of the file.

And the first decision is instantly made: since this is only a default,
certainly we should use "char *", and not something special. Hmm... 

  char *Name <- Letter LetterOrDigit*

The value is the text matched. We've said that in general a missing
expression is an error, unless the only thing happening here is a call,
in which case the value is simply propagated. So that normally

  A <- B

means

  A <- b:B -> b

But these two defaults can collide. In this grammar:

  char *Name <- NamePlus
  char *NamePlus <- l:Letter LetterOrDigit* -> l

does Name get the value of all the text it matches, or the value of the
solitary call? Ways out: 1) introduce "text" and say that solitary call
is indeed the default for all types except "text", which otherwise
behaves identically to "char *". 2) Say that, for "char *" only, matched
text beats solitary call: write "char *Name <- np:NamePlus -> np" when
that's what you want. 3) Do away with one of the defaults altogether:
3a) just say that solitary call is not a default; 3b) say that we use
Piumarta-style bra kets to mark the matched text. 4) Provide a function
/ macro that means "all the matched text".

I don't like any of these!

Source Files
============

Since pacc is partly written by pacc, it can all get a bit confusing!

emit.c - writes the output file from the AST

main.c - command line parsing, other glue

pacc.c - the packrat parser that constructs an AST from a .peg file;
constructed (by pacc) from pacc.peg

paccman.c - constructs an AST manually for bootstrapping

pacc.peg - the grammar description that is converted to pacc.c

syntax.c - support for the Abstract Syntax Tree constructed as the
grammar is read


Other Notes
===========

The key to (linear) packrat parsing is the memoization table that
squirrels away every result that the parser discovers about the input as
it is read. In a naive implementation, this is simply a matrix, with one
row for each rule of the grammar, and one column for each character of
the input. However, this matrix is (typically) very sparse, so pacc
tries to do better.

thoughts...

rows is of the order of 100 - it would be a pretty zany grammar that had
more than 1000 rules = ~7 bits

columns is of the order of 100000 - we should support inputs of 1
megabyte (or 10 or even 100 if possible), but most inputs will be *much*
smaller than that = ~16 bits

(should we consider swapping the memoization table ourselves, like sam?
or just bung it all in vm and hope for the best?)

so, thinking about a hash function for the matrix, we have about 23
bits of input, in two integers. that should be enough to do something
interesting...

---

Insight: we talk about semantic *actions*, and in ``yacc`` they are ``C``
program fragments, but really those things are expressions! In ``yacc``,
of course, the convention is that the program fragment delivers a value
by assigning to ``$$``. So these things are not really actions at all, but
*expressions*. They have values, and types.

How might we express this?

int Decimal <- d:'0' / d:'1' / ... { d - '0' }
int Decimal <- < [0-9] > { d - '0' } ???
struct Node *cmd <- ...
                 /  TWIDDLE optcaret s:word p:words { mk(nMatch, s, p) }
                 / ... 

Problems: braces around expressions look wrong... didn't Ford have a
different / better syntax? OK, all sorts of stuff coming out here (I
don't like Piumarta's < > syntax; ...). However, the idea of typing each
rule seems plausible. And the type of the first rule can become the
default type.
